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Abstract: We review the current state of the art in the determination of the parton 
substructure of the nucleon, as expressed in terms of parton distribution functions (PDFs), 
and probed in high-energy lepton-hadron and hadron-hadron cohisions, and we assess 
their imphcations for current precision colhder phenomenology, in particular at the Large 
Hadron Collider (LHC). We review the theoretical foundations of PDF determination: the 
way cross sections are expressed in terms of PDFs using perturbative QCD factorization 
and evolution, the methodology used to extract PDFs from experimental data, and the 
way in which different physical processes can be used to constrain different PDFs. We 
summarize current knowledge of PDFs and the limitations in accuracy currently entailed 
for the computation of hadron collider processes, in particular at the LHC. We discuss the 
current main sources of theoretical and phenomenological uncertainties, and the direction 
of progress towards their reduction in the future. 
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1 PARTON DISTRIBUTIONS IN PERTURBATIVE QCD 



Parton distribution functions (PDFs) encode the information on the substructure of hadrons, 
and in particular the nucleon, in terms of partons. Partons are quarks and gluons, the basic 
degrees of freedom of quantum chromodynamics (QCD), the theory of strong interactions, 
as probed in "hard" processes, i.e., high-energy processes which admit a perturbative de- 
scription. The physical, measurable cross section for any process with hadrons in the initial 
state is determined by folding PDFs with the perturbatively computable cross section that 
describes the interaction between partons. It follows that essentially no theoretical predic- 
tion at a hadron collider is possible without PDFs: indeed, the recent observation [1, 2] of a 
Higgs-like particle at the Large Hadron Collider (LHC) crucially relied on the knowledge of 
PDFs. The importance of an accurate knowledge of PDFs has thus enormously increased 
with the advent of the LHC as the frontier accelerator for the study of fundamental in- 
teractions. For example, PDF uncertainties are essential for precision determination of 
Standard Model parameters, and in making predictions for the production of possible new 
heavy particles in theories beyond the Standard Model. 

At the current stage of knowledge of strong interactions, parton distributions cannot 
be computed from first principles. They are instead determined by comparing the PDF- 
dependent prediction for one or more physical processes with its actual measured value: in 
this sense, their determination is akin to the problem of the measurement of fundamental 
constants, with the important complication that one has to determine a set of functions, 
rather than a set of numbers. 

The determination of PDFs has gone through various stages which mirror the evo- 
lution of the theoretical and phenomenological understanding of QCD. At a very early 
stage [3-9], parton distributions were determined through a combination of model assump- 
tions and the first experimental results on deep-inelastic electron-nucleon scattering (DIS). 
These determinations were semi-quantitative at best, and they were aimed at showing the 
compatibility of the data with the partonic interpretation of hard processes. 

As the accuracy of the data and confidence in perturbative QCD improved, a first 
determination of the gluon distribution was achieved [10]: this is nontrivial, because the 
gluon does not couple to leptonic probes, and it was thus determined indirectly, from 
the scale dependence of deep-inelastic structure functions. Soon thereafter, the first PDF 
sets [11, 12] were produced, based on consistent "global fits", i.e., the data-theory compar- 
ison for a set of different lepton-hadron and hadron-hadron scattering processes, chosen in 
order to maximize the information on PDFs. These analyses were all performed at leading 
order (LO), namely, using the lowest perturbative order in QCD calculations, which was 
accurate enough for these sets to be widely used for phenomenology in the ensuing decade, 
despite the early availability of next-to-leading order (NLO) tools [13]. 

However, thanks to the availability of high-precision deep-inelastic scattering and 
hadron collider data, the use of NLO theory soon became mandatory. Correspondingly, 
fairly wide sets of data of a varied nature were increasingly used as an input to the PDF 
determination, in order to minimize as much as possible the role of theoretical prejudice [14- 
17]. NLO parton sets evolved into standard analysis tools and were constantly updated 
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throughout the ensuing decade. In particular, the weahh of deep-inelastic data from the 
HERA collider led to a considerable increase of both accuracy and kinematic coverage, and 
eventually led to global parton sets (such as CTEQ5 [18] and MRST2001 [19]) which could 
provide an input at an adequate level of accuracy to the NLO QCD computations used both 
in tests of the Standard Model and in searches for new physics. These sets of PDFs, while 
differing in many technical details, shared the basic underlying approach: a functional form 
for PDFs is assumed, parametrized by a relatively small number of parameters, that are 
determined by optimizing the fit of the computed observables to the experimental data. 
The PDF set of Reference [16] was produced along the same lines, but introducing the 
extra "dynamical" assumption that at a sufficiently low scale parton distributions become 
valence-like. 

Once PDFs became a tool for precision physics, an estimate of the uncertainty on 
their knowledge became mandatory. Previously, the only way of estimating the uncertainty 
related to the parton distribution was to compare results obtained with several parton sets: 
an especially unsatisfactory procedure given that many possible sources of systematic bias 
are likely to be common to several parton determinations. The first determinations of 
parton distributions with uncertainties were obtained by only fitting to restricted data sets 
(typically from a subset of deep-inelastic experiments), but retaining all the information 
on the correlated uncertainties in the underlying data, and propagating it through the 
fitting procedure [15, 20-22]. The need for a systematic approach to the determination of 
PDFs with reliable uncertainties was stressed in the seminal papers of References [23, 24], 
where an entirely different approach to parton determination was suggested, based on 
Bayesian inference. This approach was never fully implemented, but the need for PDFs 
with uncertainties based on global fits was generally recognized. 

The problem was tackled in References [25, 26], where it was shown that in order 
to obtain statistically meaningful results the conventional methodology had to be supple- 
mented with an unorthodox treatment of uncertainties, where the standard approach must 
be supplemented by a suitable "tolerance" rescaling. Once this is done, it is possible to 
determine "error" PDF sets along with the central best-fit, which allow for a determination 
of a one-sigma contour in parameter space about the best fit. PDFs with uncertainties have 
become the standard ever since, and more refined versions of the tolerance method have 
been used in subsequent global fits from the MSTW [27] and CTEQ [28] groups. 

An alternative approach to PDF fitting was proposed in Reference [29] , and eventually 
led the NNPDF collaboration to produce a first PDF set based on DIS data [30] and then 
a PDF set from a global fit [31]. This approach differs in two main respects from the 
standard one. The first is that PDFs are represented as a Monte Carlo sample, from which 
the central value and uncertainty can be computed respectively as a mean and standard 
deviation, rather than from a best-fit and error sets. The second is that the functional form 
used for the PDF parametrization, based on neural networks, has a very large number of 
parameters (more than 250 for the PDF sets of References [30, 31], to be compared to 
about 30 for sets based on a standard parametrization). Therefore, the best-fit is not 
determined as an absolute minimum of a figure of merit (such as the x^); which, given 
the large number of parameters, would involve also fitting statistical noise, but rather by 
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stopping the minimization before the noise starts being fitted, through a suitable criterion. 

PDF sets with uncertainties, based at least on NLO QCD theory, and relying on 
a global set of data, had thus become the standard by the late 2000s. However, the 
demands of precision phenomenology, specifically at the LHC, have led to several further 
theoretical and phenomenological improvements. Firstly, with the increasing availability 
of calculations to next-to-next-to-leading order (NNLO) in QCD, now all PDF sets have 
been extended to include also sets which use NNLO QCD theory in their determination. 
Furthermore, all sets now include heavy-quark mass effects. Finally, most sets are now 
available for a variety of values of the strong coupling. 

There are presently at least three sets of PDFs with all these features which are be- 
ing maintained and updated, from the CTEQ/CT, MSTW and NNPDF collaborations. 
Further PDFs based on smaller data sets have been produced recently: by the GJR/JR 
group [32, 33], following the "dynamical" approach of Reference [16]; by the ABKM/ABM 
group [34, 35], using mostly DIS data, following the approach of References [20, 36, 37]; 
and by the HERAPDF group, which only uses HERA DIS data [38-40]. 

Several benchmarking exercises involving various sets of PDFs have been performed re- 
cently [41-45], and in particular the benchmarking of Reference [42] has led to the so-called 
PDF4LHC recommendation [46] which suggested the use of an envelope of the CTEQ/CT, 
MSTW and NNPDF PDFs for the purposes of searches, calibration (e.g., acceptance com- 
putations) and precision tests of the Standard Model at the LHC, and in particular for 
Higgs searches [47]. 

The purpose of this review is twofold. On the one hand it aims to provide an accessible 
introduction to the theory and phenomenology of parton distributions. In this respect the 
current review is more concise and pedagogical than other recent reviews [48-50] of the 
same or related topics. On the other hand, it aims to review the current state of the art 
in PDF determination, to provide an assessment of their accuracy and of the main sources 
of systematic and theoretical uncertainty on them, and to discuss the impact they have 
on LHC phenomenology. In this respect, it provides a more concise and critical snapshot 
than other recent benchmark papers mentioned above. This review updates the previous 
one, over 20 years old and by now somewhat dated, of Reference [51]. 

The structure of the remainder of the review is the following. In Section 2 we will briefly 
review the theoretical framework which underlies the definition of PDFs, the methodology 
used in their determination, and the way in which individual data sets control different 
aspects of PDFs. In Section 3 we will summarize the current PDF fits, their main features, 
and provide detailed comparisons between them. In Section 4 we will discuss the impact of 
PDFs, and particularly their uncertainties, on LHC phenomenology, specifically by com- 
puting and comparing the computation of various precision LHC observables ("standard 
candles") with different PDF sets. Finally, we comment on future prospects in Section 5. 

2 PDF DETERMINATION 

A PDF determination involves first, obtaining a theoretical prediction for various processes 
(at some given perturbative accuracy), and then comparing this prediction to the data. 
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The second step involves, in particular, a methodology in order to extract PDFs and their 
uncertainties from this comparison, and also, a choice of measurable processes in order to 
maximize the information on the various PDFs. 

2.1 Theoretical Framework 

The basic property which enables the perturbative computation of cross sections for pro- 
cesses with hadrons in the initial state is their factorization into a partonic cross section 
— computed in perturbation theory, using the quark and gluon degrees of freedom of the 
QCD Lagrangian, and independent of the incoming hadron — and parton distributions, 
which characterize the hadronic bound states, and are universal, i.e., do not depend on 
the specific process. Thanks to universality, it is possible to determine PDFs using the 
experimental information on a particular set of processes, and then use them to obtain 
predictions for different processes. Here we will review some basic results, while referring 
to Reference [52] for a textbook treatment, and to Reference [53] for detailed proofs of the 
underlying factorization theorems. 

2.1.1 Factorization for Hadroproduction 

The cross section for a generic hadroproduction process which depends on a single scale 
can be written in factorized form as 

ax{s,M]^) = J2 [ dxldx2fa/hM,Mj.)ff,/^,^{x2,M]()aab^x{xlX2S,M^) (2.1) 

a,b ■^^mi" 

^ Jt XI Jr/xi X2 \XlX2 / 

= E f V ^-'^ ^^x) Cab C^,as{Ml)) , (2.2) 

J-r X \ X / 

a,b 

where s is the center-of-mass energy of the hadronic collision, f^ii^^{xi,M'j^) is the distri- 
bution of partons of type a in the ith incoming hadron, Uab^x is the parton- level cross 
section for the production of the desired final state X, the minimum value of Xi is x^m = t, 

T = ^ (2.3) 

s 

is the scaling variable of the hadronic process, and in the last step leading to Equation 2.2 
we defined the parton luminosity 

(2.4) 

Equation 2.1 also holds for factorizable multi-scale processes (such as, say, Higgs production 
in W fusion), with a evaluated as a function of the incoming hadron momenta pi and p2-, 
and a evaluated as a function of the incoming parton momenta xipi and X2P2- 
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The hard coefficient function Cab {^^(^si^x)) ^ function of the scale M"^ and the 
dimensionless ratio of this scale to the center-of-mass energy s of the partonic subprocess: 

z = -^ = , 2.5 

S X1X2 

where r is given by Equation 2.3. A prefactor has been extracted, so that at leading 
perturbative order the coefficient function is either zero (for partons that do not couple to 
the given final state at leading order), or else just a Dirac delta: 

^ab^x = (TO Cab {z,as{M].)) , Cab {z,as{M].)) = Cafe 5(1 - z) + 0{as), (2.6) 

where the matrix Cab depends on the specific process. For example, for virtual photon 
(Drell-Yan) production, Cab is nonzero when ab is a pair of a quark and an antiquark of the 
same flavor, and in this case ctq = I'^c^s- Equation 2.5 then implies that at leading order 

TLo = X1X2. (2.7) 

The factorized result of Equation 2.1 generally holds both for inclusive cross sections 
and rapidity distributions. In the latter case, however, there is an extra kinematic con- 
straint which relates the hadronic and partonic kinematic variables. In particular, at 
leading order the rapidity Y of the final state is related to the momentum fractions of the 
two partons by 

yLO = ^ln^, (2.8) 

2 X2 

which are thus completely determined by knowledge of Y and r. 
2.1.2 Factorization for Electroproduction 

For electroproduction, specifically deep-inelastic scattering, Equation 2.1 is replaced by a 
factorized expression for the structure functions Fi{x,Q'^) that parametrize the inclusive 
deep-inelastic scattering cross section: 

^^i-^y^Q") = [Y^F^^ix,Q')TY^xFi^^ix,Q')-y'F^<'ix,Q')] , (2.9) 

for neutral-current charged-lepton (i^) DIS, where the longitudinal structure function is 
defined as 

Fl{x,Q^) = F2{x,Q^) -2xF,{x,Q^), (2.10) 

and 

y± = 1 ± (1 - 2/)2, (2.11) 

in terms of the electron momentum fraction 

y^H = ^. {2-12) 

p ■ k xs 

and p and k are respectively the incoming proton and lepton momenta, q is the virtual 
photon momentum (g^ = — Q^), and in the last step, which holds neglecting the proton 
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mass, s is the center-of-mass energy of the lepton-proton colhsion. Similar expressions 
hold for charged-current scattering. 

The factorized expression for the structure functions is 

F,{x,Q') = xY, ['-a,a{-,as{Q')) fa{z,Q'). (2.13) 

Here, in the argument of the structure function x = is the standard Bjorken variable, 
the hard coefficient function Ci^a is the structure function computed with an incoming 
parton, and fa{z, Q"^) is the distribution of the parton a in the only incoming hadron. Also 
in this case at lowest O(aJ^), the coefficient function Ci^a is either zero (for incoming gluons) 
or a constant (an electroweak charge) times a Dirac delta. 

2.1.3 Perturbative Computations 

The factorized expressions in Equations 2.1 and 2.13 express the hadronic cross section in 
terms of PDFs at the same scale, M'j^ or Q^, at which the hadronic cross section is evaluated. 
However, PDFs at different scales are related by perturbative evolution equations, namely 
the integro-differential equations 

d /S(.,Q2)\ ^ ['^(P^A^^asm) 2nfP^^[^,asm)\ (j:{y,Q')\ 
d 



(x,q2) = j'^ ^/^f (^,«5(Q^)) ^ny,Q% (2.14) 



51nQ2 

where g is the gluon distribution, S denotes the singlet quark distribution defined as 

nf 



S(x, Q2) ^ {qi{x, Q') + Qiix, Q')) , (2.15) 



i=l 



and the nonsinglet quark distributions are defined as any linearly independent set of 2nj — 1 
differences of quark and antiquark distributions, q^^{x,Q'^) = qi{x,Q'^) — qj{x,Q'^). The 
splitting functions Pab are perturbative series in as, that start at order as at LO. 

There are some constraints on perturbative evolution due to conservation laws, which 
hold at all scales: specifically the conservation of baryon number 

/ dx {qi{x,Q'^) - qi{x,Q'^)) = Hi (n„ = 2, = 1, n^^cAt = 0)> (2.16) 
Jo 



and the conservation of total energy-momentum 
dxx ^ 



,i=l 



1. (2.17) 



Combining the factorized expressions in Equations 2.1 and 2.13 with the solution to the 
evolution equations, physical observables can be written as the convolution of a prefactor. 
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which contains both the coefficient functions and the kernel that solves the evolution equa- 
tions, with PDFs defined at some reference scale. In all available determinations, PDFs 
are parametrized at a fixed reference scale Qq, and the solution to the evolution equations 
is used to produce tables of PDFs as a function of x and that are delivered to users, 
for example, through the standard LHAPDF interface [54]. 

Several public codes for the solution of the evolution equations are available [55-57]: 
these codes, as well as most of the codes used internally by the various collaborations per- 
forming PDF determinations, are benchmarked against standard tables [58, 59], originally 
produced by using two of these codes [55, 56] to evolve a set of reference toy PDFs. 

Because PDFs are extracted from a particular set of processes, and then used to make 
predictions for other processes, the perturbative accuracy of the predictions is limited by 
the perturbative accuracy of the computation of the processes used in PDF determination. 
The accuracy of a perturbative QCD computation is fixed by the number of orders which 
are included in the computation of the coefficient functions in Equation 2.6 (and their 
deep-inelastic counterparts in Equation 2.13), and of the splitting functions Pab which 
enter the evolution equations. Leading order means that both are computed to the lowest 
nonvanishing order, so splitting functions to order as and coefficient functions to order 
for deep-inelastic scattering or Drell-Yan production, to order a| for Higgs production via 
gluon fusion, and so on. 

Currently, splitting functions are known up to NNLO [60, 61], and coefficient functions 
are known up to NNLO for several processes used for PDF determination, such as Drell-Yan 
rapidity distributions [62], though not yet for jet production (for DIS they are even known 
up to N'^LO [63, 64]). PDFs may thus be determined up to NNLO accuracy, although NLO 
PDFs are also important because several collider processes are only known up to NLO. LO 
PDFs can be useful for use in conjunction with Monte Carlo event generators [65, 66], 
and are thus often optimized for this purpose [67-69], for example, by introducing some 
modifications of the standard LO expressions which partly simulates the missing higher- 
order terms. 

2.1.4 Treatment of Heavy Quarks 

An important subtlety involves the treatment of heavy quarks. Indeed, decoupling argu- 
ments [70] imply that the contribution of heavier quark flavors to any process are power- 
suppressed at scales which are below the threshold for their production [71]. Therefore, 
whereas in principle the QCD Lagrangian contains six quark flavors, in practice only a 
smaller number of "active" flavors are included in loops, and thus in particular when 
determining the running of as and solving the evolution equations. When expressing pre- 
dictions for processes at various disparate scales in terms of a single set of PDFs it is thus 
necessary to use a so-called variable-flavor number (VFN) scheme, whereby different num- 
bers of active flavors are adopted at different scales. Use of a fixed-flavor number (FFN) 
scheme only allows comparison with the data in a restricted range of scales. 

In all PDF sets currently in use, the PDFs for charm and heavier quarks are not 
independently parametrized. Rather, heavy-quark distributions are generated as the result 
of pair production from gluons, which, at higher perturbative orders, can in turn be radiated 
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from quarks. Such terms may appear as contributions to the coefficient functions, or as 
a result of perturbative evolution. If one varies the scale M"^ or at which PDFs are 
evaluated in the factorized expressions of Equations 2.1 or 2.13, they are reshuffled between 
the coefficient function and the solution to perturbative evolution equations. 

This entails a further complication because, in the vicinity of the threshold for heavy- 
quark production, the quark mass cannot be neglected. Whereas there is no difficulty in 
principle in including the full dependence on the heavy-quark masses in coefficient func- 
tions, the solution of evolution equations only generates terms which depend logarithmically 
on the heavy-quark mass itself. It is thus necessary to explicitly include terms suppressed by 
powers of the heavy-quark mass in the coefficient functions, while subtracting the logarith- 
mically enhanced, unsuppressed terms that are already generated by solving the evolution 
equations in order to avoid double counting. 

At present, there exist at least three different schemes to do so, all of which have been 
worked out up to NNLO, i.e., including the exact dependence on the heavy-quark mass up 
to 0(0^), and all of which are based on the renormalization scheme with explicit quark 
decoupling of Reference [71]: ACOT [72, 73], recently extended to NNLO [74], TR [75-77], 
and FONLL, originally proposed for hadronic processes [78] and more recently extended 
to DIS [79]. These schemes have been benchmarked in Reference [80] (except ACOT at 
NNLO, then not available), where they were shown to differ by subleading terms, which 
may not be entirely negligible at NLO in the vicinity of the quark threshold, but rapidly 
decrease at NNLO. Specifically, at NLO FONLL and ACOT coincide exactly while TR 
differs by ©(aKm^)) terms; at NNLO FONLL and TR differ by C>(a|(m^)) terms [80], 
while FONLL and ACOT are expected [74] to differ by 0{al{Q'^)). There also exists a 
scheme (BMSN [81]) which enables the inclusion of a heavy flavor as active in the running 
of the coupling, while not including it among the active flavors when solving evolution 
equations. 

2.1.5 Electroweak Corrections 

So far we have only discussed higher order perturbative corrections in the strong interac- 
tions. However, any process which involves electroweak interactions, such as deep-inelastic 
scattering or Drell-Yan production, also receives higher-order corrections in the electro- 
magnetic or weak interactions. These will enter in both coefficient functions and evo- 
lution equations. Roughly, because at the electroweak scale the fine structure constant 
a ~ a| ~ one expects NLO corrections in the electromagnetic interaction to become 
relevant when computing at the NNLO order of the strong interaction. Such corrections 
are not included in any of the most recent PDF determinations, though in particular the 
effect of QED corrections to evolution equations has been discussed in References [82 , 83] . 

2.2 Fitting Methodology 

Parton distributions are determined by comparing factorized expressions of the form of 
Equations 2.1 and 2.13 with experimental data. A confidence interval in the space of PDFs 
is then determined by minimizing a suitable measure of goodness-of-fit. This is nontrivial, 
because it involves defining a probability measure on a space of functions [24]. Two main 
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methodologies to represent such a probability measure are currently used: the Hessian 
approach and the Monte Carlo approach. In any case, the infinite-dimensional problem 
of representing a space of functions must be reduced to a finite-dimensional form to be 
manageable, and this is done by introducing a PDF parametrization, for which several 
choices are possible. 

2.2.1 Goodness-of-fit 

Goodness-of-fit is measured by a function (see, e.g., Reference [84]) 

= E E(^^ -^*) (D.-T,). (2.18) 

i=l i'=l 

Here, Di are the data points, Tj are the theory predictions, and the experimental covariance 
matrix 

^ corr 

Vii' = bat (cJi J + 2^ ^Ki o-fc/ > (2-19) 

fc=i 

z = 1, . . . ,iVdat labels the individual data points, generally affected by uncorrelated (sta- 
tistical and systematic) uncertainties 0"™'^°", and k = l,...,A^corr sources of correlated 
systematic uncertainty o'^^l^ . 

Diagonal entries in the covariance matrix are simply the sum in quadrature of all 
correlated and uncorrelated uncertainties: if information on correlations is unavailable one 
may thus simply add correlated and uncorrelated uncertainties in quadrature. However, 
proper inclusion of correlations is necessary in order for the to provide a faithful measure 
of goodness-of-fit, as neglecting correlations leads to an overestimation of uncertainties that 
may be substantial. A typical situation where this may happen is when the correlated and 
uncorrelated uncertainties are comparable in size, and A'dat ^ ^corr- 

The of Equation 2.18 can be rewritten by introducing A'corr shift (or nuisance) 
parameters r/j [25, 85]: 

A^dat / A \ 2 ATcorr 

j=i \ * / fc=i 

where 

corr 

A = A - E^'^^"- (2.21) 

fc=i 

Minimizing the iii Equation 2.20 with respect to the shift parameters gives back 
Equation 2.18, so that these two expressions are completely equivalent, and either can be 
used in practice. The advantage of the expression in Equation 2.21 is that it is possible to 
study the behavior of the shifts at the minimum: specifically, their distribution ought 
to be univariate Gaussian with mean zero. 

There is a subtlety related to the possibility that some of the uncertainties may be 
multiplicative, which becomes relevant if the best fit is determined by minimizing the x^ of 
Equation 2.18. An uncertainty is multiplicative if the size of the uncertainty is proportional 
to the measured value, as is the case, for example, for an overall normalization uncertainty. 
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In such case, it can be shown [86] that minimization of the in Equation 2.18 would lead 
to biased results. Various ways of dealing with this problem are discussed in Reference [87], 
and a recent summary of the approach adopted by various PDF fitting groups is in the 
appendix of Reference [45] . 

2.2.2 Parton Parametrization 

A set of PDFs is a set of functions, one for each parton entering the factorized expressions 
in Equations 2.1 and 2.13. Because PDFs at different scales are related by the evolution 
equations, the goal is to determine a set of functions for < x < 1 at some reference scale 

Ql 

There are in principle thirteen independent PDFs in a given hadron (six quarks and 
antiquarks and the gluon); however, in practice, charm and heavier quark PDFs in the nu- 
cleon are not independently determined in all current PDF sets, and are instead assumed 
only to be generated by QCD radiation. The (moderate) impact of introducing an indepen- 
dent (non-perturbative) charm PDF, so that charm does not vanish below the threshold 
for its radiation ("intrinsic" charm [88]) has been studied in References [27, 89]. While in 
the past some relations between PDFs (such as, for example, equality of the strange and 
antistrange PDFs) have been introduced by assumption, the standard for current precision 
studies is to have a set of seven independent PDFs. In practice, in many cases, it turns 
out to be convenient to express the six light quark PDFs as suitable linear combinations, 
like the singlet combination of Equation 2.15. 

Once a suitable set of basis PDFs has been chosen, all existing PDF determinations 
are based on choosing a parametrization of PDFs at the reference scale. A standard choice, 
adopted by most PDF fitting groups, is to assume that 

fi{x,Ql) = x''^{l-xf^g^{x), (2.22) 

where gi{x) tends to a constant for both x — )• and x — )■ 1. This choice is motivated by the 
expectation that PDFs behave as a power of x as x — )• due to Regge theory, and as a power 
of (1 — )■ 1 due to quark counting rules (see, e.g.. Reference [90] and references 
therein). Specific choices for the function gi{x) differ between groups. Common choices are 
a polynomial or the exponential of a polynomial in x or y/x, with more parameters used to 
describe PDFs for which more information is available (such as the gluon) in comparison 
to those (such as the strange PDF) that are poorly constrained by the data. Typical 
contemporary PDF sets based on this choice of functional form are parametrized by about 
20-30 parameters (see Section 3 for a detailed discussion). 

An altogether different option is to parametrize PDFs with a general functional form 
which does not incorporate any theoretical prejudice. Two options that have been con- 
sidered recently are neural networks [29-31] and Chebyshev polynomials [91], though only 
in the former case has a full-fledged PDF set been constructed. In this context, neural 
networks just provide a convenient unbiased set of (nonlinear) basis functions. The neural 
networks used for PDF parametrization in References [29-31] are multilayer feed-forward 
neural networks, one for each PDF and all with a fixed architecture (and thus number of 
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parameters), determined to be greatly redundant for the problem at hand. In this case, the 
number of free parameters is of order of 200-300. The intermediate option of supplement- 
ing a parametrization of the form of Equation 2.22 by a prefactor written as an expansion 
over Chebyshev polynomials in order to study potential parametrization bias has recently 
been explored in References [92, 93]. 

When unbiased PDF parametrizations are adopted, and specifically when the num- 
ber of free parameters is very large, the absolute minimum of the figure of merit is not 
necessarily the best fit, as this may correspond to a result that might reproduce random 
fluctuations in the data, or display oscillations which are unlikely to be present in the 
(unknown) true result — we will discuss briefly in Section 2.2.3 how this difficulty may 
be circumvented. This clearly shows that the determination of a set of functions from a 
finite set of data points is mathematically an ill-posed problem, and thus that the choice 
of a PDF parametrization is a necessity. Whereas in all PDF determinations one tries to 
minimize all sources of theoretical bias, a certain amount of theoretical prejudice is thus 
always necessary in order to get a definite answer. 

2.2.3 Representation of PDF Uncertainties 

There exist (at least) two commonly used ways of representing probability distributions 
in the space of PDFs. The first (so-called Hessian) option is based on the standard least- 
squares method [84]. This is the procedure that is most commonly adopted when using a 
parametrization with a relatively small number of parameters. It is based on the assump- 
tion that the probability distribution in the space of PDFs is a multi-Gaussian in parameter 
space. Given a set of experimental data points for a collection of processes that depend 
on PDFs through factorized expressions of the form of Equations 2.1 and 2.13, one first 
determines a most likely PDF as the best-fit PDF, in turn given by the set of parameters 
which minimizes the in Equation 2.18 for the data-theory comparison (the way in which 
individual processes constrain specific PDFs is discussed in Section 2.3 below). 

Once the best-fit has been determined, a confidence level (C.L.) about it is determined 
by expanding the in parameter space about its minimum to lowest nontrivial order. The 
desired confidence level is obtained as the volume in parameter space about the minimum 
that corresponds to a fixed increase of the x^. For Gaussian uncertainties, the 68% (or one- 
sigma) confidence level corresponds to the volume enclosed by the = Xmin + ^ surface. 
This is called the Hessian method, because the confidence level is entirely determined by 
the covariance matrix in parameter space, which is the inverse of the (Hessian) matrix of 
second derivatives of the with respect to the parameters, evaluated at the minimum. 

In practice, in actual PDF fits involving large numbers of experimental data points from 
different experiments, it turns out that the textbook criterion of varying by Ax^ = 1 
in order to determine the one-sigma contour leads to unrealistic results. This conclusion 
was arrived at [25, 94] by comparing the parameter values that provide the best fit to each 
set of experimental data: it is found that these best-fit values fiuctuate much more than 
one would expect if Ax^ = 1 did actually provide a 68% confidence level in parameter 
space. We will come back to the explanation for this fact, but we note immediately that it 
could be due to neglect or underestimate of one or more sources of data uncertainty. More 
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realistic results are obtained thus by assuming that the 68% confidence level is obtained by 
letting A^^ = T^, where T is a "tolerance" parameter, in turn determined by studying the 
distribution of best-fit parameter values among experiments, e.g., imposing that indeed 90% 
of experiments approximately fall within the 90% confidence level. More refined methods 
involve determining a different tolerance [27] along each Hessian eigenvector ( "dynamical" 
tolerance). 

An obvious advantage of the Hessian method is that it allows for a compact repre- 
sentation and computation of PDF uncertainties, by simply providing eigenvectors of the 
Hessian matrix rescaled by their respective eigenvalues, i.e., in practice, PDF sets which 
correspond to the variation by a fixed amount (such as one-sigma, or 90% C.L.) along the 
direction of each eigenvector. PDF uncertainties on the PDFs themselves, or any observ- 
ables that depend on them, are then simply found by adding in quadrature the variation 
along each direction. So in a Hessian approach one delivers a central set of PDFs Sq, and 
A'par one-sigma error sets Si, corresponding to the variation of each eigenvector in turn. 
The best-fit value of any quantity F{S) which depends on the PDF set (such as a cross 
section, or a PDF itself), and its one-sigma uncertainty, are respectively: 



Fo = F{So), ap 



j2 IF(Si) - F(S„)f. (2.23) 



(In practice, a slightly more complicated formula is often used that gives asymmetric uncer- 
tainties). The price to pay for this (besides the need to use linearized error propagation) 
is that Hessian determination and diagonalization rapidly become unmanageable if the 
number of parameters is too large. 

An alternative way of representing probability distributions in the space of PDFs is the 
Monte Carlo method, whereby the probability distribution of PDFs is given by assigning 
a Monte Carlo sample of PDF replicas, namely A'^rcp PDF sets S^. Any feature of the 
probability distribution can be determined from the Monte Carlo sample. So, the best-fit 
value of any quantity F{S) which depends on the PDF set (such as a cross section, or a 
PDF itself) is now determined as its expected value, namely as the mean over the replica 
sample: 



^ JVrcp 

'■'^P k=i 



while the one-sigma interval is now computed as a standard deviation 



JVrcp 



The obvious advantage of the Monte Carlo method is that it does not require assumptions 
to be made on the form of the probability distribution in parameter space, and also that 
it provides a direct representation of the probability distribution, which is convenient for 
many applications, as we shall see shortly. 
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There are various ways of constructing a Monte Carlo PDF replica sample. One 
possibility [29-31] is to first construct a Monte Carlo representation of the starting data 
sample. This means that, instead of giving a list of data points Di with the covariance 
matrix Vij of Equation 2.19, one constructs a set of A'rep data replicas Z?^, with i = 
1, . . . , A'dat and k = 1, . . . , A'rep, that reproduce the probability distribution of the data, 
i.e., such that the data points Di and the generic element of the covariance matrix can be 
respectively found by computing the average or the covariance over the replica sample: 



'■''p fc=i 

^ E E - (A)) {d^ - (D,)) . (2.27) 



i=i j=i 

One may verify a posteriori that when Nj-^p is large enough, then (Di) tends to the ex- 
perimental data points Di, and covjj tends to the experimental covariance matrix Vij of 
Equation 2.19. The Monte Carlo sample of PDFs is then determined by fitting a PDF set 
to each data replica, which can be done as above by minimizing a suitable figure of 
merit. The set of data replicas is thus mapped onto a set of PDF replicas. 

This procedure is especially advantageous if PDFs are parametrized with a very large 
number of parameters so that a reliable determination of the Hessian matrix is impractical 
or impossible, because it only requires the determination of a best-fit PDF set for each 
data replica, without full knowledge of the Hessian. 

However, even the determination of this best-fit may be nontrivial if the number of 
parameters is very large: in this mentioned, false minima and spurious fiuctuations 

may arise. Two methods to avoid this have been considered in the literature. One is the 
cross-validation method [95]: the data are randomly divided into two sets ("training" and 
"validation"); the is then computed for both sets separately, but only the of the 
training set is minimized. Initially both the training and validation decrease, but at 
some point the training keeps decreasing while the x^ of the data in the validation 
set starts increasing. The point at which this happens defines the best fit. Picking a 
different partition of the data into training and validation sets for each replica ensures 
that there is no information loss, though of course this is only true in the limit of a large 
number of replicas. The other method consists of adding to the a penalty term that 
disfavors functional forms which are too complex: an option that has been considered in 
the literature [91] is to penalize PDFs which are longer with respect to a suitable metric 
(and thus fluctuate too much). This procedure is perhaps more efficient, but it entails some 
subjectivity in the choice of metric. At present, only the cross-validation method has been 
implemented in a full-fiedged PDF determination [30, 96, 97] 

Alternatively, one may view the Monte Carlo method as a different way of delivering 
results that have already been obtained by means of the Hessian method. In this case, 
one may construct [98] the PDF replicas S'' by generating a multi-Gaussian distribution of 
parameter values, centered at the best fit and with width provided by the Hessian matrix 
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itself, which is easily done by choosing a basis in which the Hessian matrix is diagonal, i.e., 

FiS') = FiSo) + [F{S,) - F(5o)] R^, (2.28) 
i=i 

where Rj is a random number taken from a univariate Gaussian distribution with mean 
zero, and Sq and Sj are the usual best-fit and eigenvector PDF sets. In this case, it is 
possible to verify a posteriori that the number of Monte Carlo PDF replicas A'rcp is large 
enough that the original central value and Hessian covariance matrix are reproduced. In 
practice it turns out that A'rep ~ 50 — 100 replicas are necessary and sufficient to either 
reproduce the input data set of a typical present-day global fit [30, 96, 97], or its output 
Hessian PDFs [98]. 

There are several reasons why it is useful to construct a Monte Carlo representation of 
a PDF set, even if it has been determined in a Hessian approach. One reason is that once 
a Monte Carlo representation of a given PDF set is available, new data can be included 
without performing a new fit, through Bayesian reweighting, whereby the original Monte 
Carlo replicas are supplemented by a weight which takes into account the effect of the new 
data. The correct implementation of this technique, originally suggested in Reference [23], 
was worked out in References [99, 100], while in Reference [98] it was shown how it can 
be implemented in a Hessian fit: its only limitation is that as new data are added, the 
number of starting replicas should be increasingly large for the accuracy of the Monte Carlo 
prediction to be preserved. On the other hand, it was pointed out in Reference [24] that if 
the number of new data included in this way is increasingly large, then the dependence on 
the original PDF set and parametrization becomes increasingly weak, and thus all issues 
related to the choice of PDF parametrization (such as potential bias) become increasingly 
less relevant. 

Another reason why a Monte Carlo representation is useful is that it provides a simple 
way of combining results obtained by different groups. If, in particular, different groups 
arrive at independent PDF determinations using the same (or almost the same) data and 
theory, with differences only being due to either methodological choices, or theoretical 
differences which are beyond the accuracy of the calculation (such as, for example, different 
NNLO terms in a NLO computation) there is a priori no way of deciding which group 
provides the most reliable determination. An effective way of combining results, while 
keeping into account the possibility of methodological differences, and thus arriving at a 
more reliable result, is to simply produce a Monte Carlo set in which an equal fraction of 
replicas comes from each of the various groups [48, 98], as we shall see more explicitly in 
Section 4.1.3 below. This statistical combination is meaningful even, or especially, if the 
PDF sets from the different groups are strongly correlated. Moreover, the fraction taken 
from each group need not be equal if it is desired to include the PDF set from one or more 
groups with a different weight from the others. 

2.2.4 PDF Uncertainties: the State of the Art 

The availability of several distinct methodologies for PDF parametrization and determina- 
tion allows for a comparison of results. Detailed comparisons will be presented in Section 3 
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below, but we note immediately that results obtained with parametrizations of the form of 
Equation 2.22 and a Hessian method with a tolerance criterion to determine uncertainty 
bands are generally in reasonably good agreement with those obtained using very gen- 
eral parametrizations and a Monte Carlo method with a stopping criterion to determine 
replica best fits. Also, both seem in good agreement with results obtained using Bayesian 
reweighting, which do not depend on a fitting procedure. This provides good evidence that 
current estimates of PDF uncertainties are not too far off the mark. 

However, it would be interesting to understand in more detail how the statistical 
features of the underlying data propagate onto the statistical features of PDF sets. In 
particular, it would be interesting to understand the detailed reasons for the need to intro- 
duce tolerance in the Hessian procedure, and how it relates to the distribution of best-fit 
replicas when the Monte Carlo method is used in conjunction with a very general PDF 
parametrization. Note that, in principle, tolerance could be introduced also in the Monte 
Carlo approach simply by rescaling experimental uncertainties during the generation of 
data replicas. 

It is clear that (at least) two different reasons may explain the need for tolerance. 
One is the presence of data inconsistencies, or equivalently, neglected or underestimated 
sources of uncertainty in PDF fits based on very broad data sets. The second is the fact 
that the choice of PDF parametrization is restricting the space of accessible PDFs. Various 
investigations of the relative importance of these effects have been presented: based on the 
Hessian approach with a standard [101] or extended Chebyshev [92, 93] parametrization, or 
on the Monte Carlo approach with a standard [98] or neural network [102] parametrization. 
These studies shed light on the relevant issues, but no consensus has yet emerged on the 
relative impact of these effects on tolerance and on the precise relation between tolerance 
and the way in which cross-validated fits explore the space of PDF minima. While the 
current accuracy in the determination of PDF uncertainties is most likely acceptable if 
compared to other sources of uncertainty in the computation of collider processes, the 
needs for greater accuracy which come from the availability of higher-order computations, 
and the hope of discovering new physics effects in small deviations between the data and 
current predictions, will require a deeper level of understanding of these issues. 

PDF uncertainties, as we have discussed them so far, are those that follow from prop- 
agation of the uncertainty of the experimental data that underlie the PDF determination. 
As the accuracy increases, however, other sources of uncertainties, and in particular all un- 
certainties related to the theory used in PDF determination, become relevant. At present, 
the only way of dealing with such uncertainties is to make sure that they are small enough. 
For example, the impact of different choices of treatment of heavy-quark masses was stud- 
ied in References [80, 103]. The impact of higher order corrections is studied by comparing 
NLO to NNLO PDFs. Higher- twist (power-suppressed) corrections are kept under control 
by removing data below some low cutoff scale that may be affected by them, and their 
impact can be studied by varying this cutoff [104]. Nuclear corrections, that affect some 
deep-inelastic scattering data in which targets are deuterons or heavier nuclei, rather than 
just protons, have been studied by including such corrections according to various mod- 
els [27, 105], or by attempting to fit the corrections directly [27, 93]. In the future, a more 
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systematic approach to each of these sources of theoretical uncertainty may be desirable, 
and in particular it may become necessary to provide PDF sets with an estimate of the 
theoretical PDF uncertainty. 

2.3 Data Constraints Before the LHC 

We have seen that a typical PDF set includes seven different PDFs. This means that 
at least seven independent physical processes for given kinematics are needed in order to 
determine all the PDFs. In principle, this could be done by using deep-inelastic scattering 
alone, though in practice it is convenient to use a broad combination of data from both 
electro- and hadroproduction in order to obtain accurate results. Here we will discuss 
how pre-LHC data can be used to determine PDFs, while the impact of LHC data will be 
discussed in Section 4 below. 

A primary role is played by DIS and Drell-Yan data. To see this, note that the 
factorized expressions in Equations 2.1 and 2.13 immediately imply that at leading order 
deep-inelastic structure functions and Drell-Yan rapidity distributions provide a direct 
handle on individual quark and antiquark PDFs (DIS), or pairs of PDFs (Drell-Yan). It 
is thus possible to understand what is dominantly measured by each individual process by 
looking at the leading order expressions. 

The leading order contributions to the DIS structure functions Fi and are (at 
leading order F2 = 2xFi): 

NC F^'"'^ = Z,Biiqi + qi) 

NC F3^'^^ = X:.A(g. + g.) (2.29) 
CC F^^ =u+d+s+c 
CC -F^^/2 = u-d-s + c, 

where NC and CC denotes neutral- or charged-current scattering and the contributions 
coming from Z exchange and from 7Z interference have couplings 

B,{Ml) = -2egViVgPz + (V^ + Aj){Vg^ + A^pI, 

Dg{Ml) = -2egAeAgPz + WiA^VgAgPl (2.30) 

where the electroweak couplings of quarks and leptons can be found, e.g., in Reference [48] 
and Pz = M'^/{M'^ + M"^). The leading order contribution to Drell-Yan production is 

7 d]vSd^(^i'2^) = ^E.efL^nxi,X2) 

W g = ^^^#^Eul^if l^^^(^i,^2) (2.31) 

in terms of the differential leading order parton luminosity 

(xi, X2) = qi{xi, Ml) qj{x2, M^) + qi{x2, M^) qj{xi,Ml) (2.32) 
and the CKM matrix elements Va. 
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It follows that a determination of deep-inelastic structure functions Fi and F3 for 
charged-current deep-inelastic scattering provides four independent linear combinations 
of quark distributions (if can be distinguished), with two more linear combinations 
provided by neutral-current structure functions: all individual light quark and antiquark 
flavors can then be determined by linear combination. This situation would be realistic at 
a neutrino factory with both neutrino and antineutrino beams and the possibility of iden- 
tifying the charge of the final state lepton on an event-by-event basis [106]. Unfortunately, 
this theoretically and phenomenologically very clean option is at best far in the future, so 
at present the information on individual PDFs can only be achieved by combining informa- 
tion from different processes, each of which provides independent information, as we shall 
now discuss. 

2.3.1 Isospin Singlet and Triplet 

In neutral-current deep-inelastic scattering, only the charge-conjugation even combination 
Qi + Qi can be determined. Specifically, photon DIS data only determine the fixed combina- 
tion in which each flavor is weighted by the square of the electric charge, see Equation 2.29. 
However, an independent combination may be accessed by also measuring DIS on a neu- 
tron (in practice deuterium) target, and using isospin symmetry to relate the quark and 
antiquark distributions of the proton and neutron: 

uP{x,Q^) = d''{x,Q^), dP{x,Q'^) =u'^{x,Q^). (2.33) 

One then has 

(x, Q2) _ F^{x, Q2) = ^ [(uP + uP) - {dP + dP)] [1 + 0{as)] (2.34) 

so that the difference of proton and neutron structure functions provides a leading-order 
handle on the isospin triplet combination 

r3(x, = u{x, + u(^x, Q2) _ [d{x, Q2) + d{x, Q2)] . (2.35) 

2.3.2 Light Quarks and Antiquarks 

DIS data from HERA are available both for NC and CC scattering, both with electron 
and positron beams. Unfortunately, collider data only provide a fixed combination of the 
structure functions Fi and F3, because for given x and Equation 2.9 implies that y 
can be varied only by changing the center-of-mass energy of the lepton-hadron collision. 
Hence, HERA data only provide three independent combinations of structure functions and 
thus of parton distributions (NC and CC with positively or negatively charged leptons). 
However, a fourth combination may be obtained because the dependence of the 7* and 
Z contributions to NC scattering is different (see Equation 2.30). It follows that the very 
precise HERA data can determine four independent linear combinations of PDFs, which 
can be chosen as the two lightest fiavors and antiflavors. 

Currently, neutrino deep-inelastic scattering data are available on heavy, approximately 
isoscalar, nuclear targets. The energy of the neutrino beam usually has a spectrum, so the 
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value of y given by Equation 2.12 is not fixed, and tlie contributions of -Fi and to the 
cross section can be disentangled. On an isoscalar target at leading order 



= x{u + u + d + d + 2s + 2c) + 0{as), 
F| = x{u + u + d + d + 2s + 2c) + 0{as), 
F^ = u - u + d - d + 2s - 2c + 0{as), 

= u-u + d-d-2s + 2c + 0{as), (2.36) 

so neutrino data provide an accurate handle on the total valence component 

Vix, Q2) = Y^iqiix, Q2) _ ^.(^, Q^)), (2.37) 

4 = 1 

A more direct determination of the light flavor decomposition can be obtained using the 
Drell-Yan process, and in particular by selecting different PDF combinations, by looking at 
different final states. Specifically [107], for neutral-current Drell-Yan production on proton 
and neutron (or deuteron) targets, using the isospin symmetry in Equation 2.33 one gets 
at leading order 

l:uPdP + U^uP , , , , 

where "heavier quarks" denotes strange and heavier flavors, which give a smaller contri- 
bution at least in the region of x > 0.1. In particular, in the "valence" region x > 0.1 the 
up distribution is roughly twice as large as the down distribution (assuming u ~ d), so the 
first term in both the numerator and the denominator of Equation 2.38 gives the dominant 
contribution, and the ratio reduces to ~ ^ . Hence this particular combination of cross 
sections provides a sensitive probe of the u/d ratio. 

In the charged-current case, one may exploit the fact that using charge-conjugation 
symmetry to relate the p and p PDFs, 

Qf = of, (2.39) 

at leading order one gets 

^ ^ uP{xi)dP{x2) + dP{xi)uP{x2) 

o^Z- ~ dP{xi)uP{x2) + uP{xi)dP{x2) 

(2.40) 

where heavy quarks denotes charm and heavier flavors, and we have assumed that cross 
sections are differential in rapidity. If the kinematics are chosen in such a way that Xi are 
in the "valence" region, where quark distributions are sizably larger than antiquark ones, 
then the ratio of Equation 2.40 is mostly sensitive to the light quark ratio u/d [108, 109]. 

2.3.3 Strangeness 

Strangeness is nontrivial to determine, because it has the same electroweak couplings as 
the down distribution, while it is typically smaller than it. The only way of determining it 



+ 0{as) + Cabibbo suppressed + heavy quarks 
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accurately from DIS is to include semi-inclusive information. A simple way of doing this is 
to use data for neutrino deep-inelastic charm production (known as "dimuon" production). 
At leading order the structure functions are then 

F!^^'\x, Q2) = xF^P'\x, Q2) = 2x (iKrfp d{x) + s{x) + \V,k? h{x)) + 0{as), (2.41) 

while F2^''^{x,Q'^) = —xF.^^''^{x,Q'^) probes the corresponding antiquarks. 

Drell-Yan data also constrain strangeness. Specifically, the cross-section ratio of Equa- 
tion 2.40 receives a contribution from strange and charm quarks which, up to CKM matrix 
elements, coincides with the contribution from down and up quarks respectively. Well 
above the charm threshold this contribution is sizable, so comparing high- and low-scale 
Drell-Yan data potentially leads to a rather accurate determination of strangeness [31]. 

2.3.4 Gluons 

The determination of the gluon distribution is nontrivial because the gluon does not couple 
to electroweak final states. It does, however, mix at leading order through perturbative 
evolution: so even in LO expressions for cross sections and structure functions, the gluon 
does determine their scale dependence. Indeed 



dlnQ^ 



;Fi{x,Q')= [ ^\p^'J-,as{Q'))Fi{y,Q') + 2nfP^'(-,as{Q'))g{y,Q 

J X y \_ \y J \y j 

(2.42) 

where by F^(x,(5^) we denote the singlet component (defined as in Equation 2.15) of the 
F2 structure function. 

It follows that the gluon is mostly determined by scaling violations, or by its coupling to 
strongly-interacting final states, i.e., jets. The main shortcoming of the determination from 
scaling violations is that the gluon only couples strongly to other PDFs for sufficiently small 
x: specifically, at large x, in Equation 2.14 rapidly becomes negligible in comparison 
to Pqq. Hence, the large x gluon is likely affected by large uncertainties, which can only be 
reduced by looking at hadronic (jet) final states. 

2.3.5 Global Fits 

In current state of the art "global" fits, information on PDFs is maximized by combining 
experimental information on an array of different physical processes, which provide a handle 
on different PDFs or combinations of PDFs, in various kinematic regions. In Table 1, taken 
from Reference [27], we list the processes that are included in a typical present-day global 
fit (MSTW08), and the PDFs they constrain. The CTEQ and NNPDF global fits, to be 
discussed below in Section 3, have similar features. 

Based on this table and the previous discussion we conclude that: 

• information on the overall shape of quarks and gluons at medium x, as well as on 
the isosinglet-isotriplet separation, come from fixed-target DIS data on proton and 
deuterium targets (dominated by 7* exchange); 



+o(4). 
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Process 


Subprocess 


Partons 


X range 


/J— 1- r ~1 /I— 1- x r 




q,q,g 


X > 0.01 




7* d/u —7- d/u 


d/u 


X > 0.01 




uu, dd — 7* 


q 


0.015 < X < 0.35 


pn/pp — A 


{ud)/{uu) — 7* 


d/u 


0.015 < X < 0.35 


v{v)N 


^ q' 


q,q 


0.01 < X < 0.5 




W*s c 


s 


0.01 < X < 0.2 


vN ^ X 


W*s c 


s 


0.01 < X < 0.2 


_|_ _|_ -I , 

p^ X 




g,q,q 


0.0001 < X < 0.1 


-4- — TT- 

e^p — 7- A 


{d,s} {n,c} 


d, s 


X > 0.01 


e^p — )■ cc X 


7*c — > c, 7*5 — > cc 


c, g 


0.0001 < X < 0.01 


e^P — ^ jet + X 


7*5 qq 


g 


0.01 < X < 0.1 


— )■ jet + X 


gg, qg, qq 2j 


g,q 


0.01 < X < 0.5 




ud ^W,ud 


u, d, u, d 


X > 0.05 




uu, dd ^ Z 


d 


X > 0.05 



Table 1. The main processes included in the MSTW 2008 global PDF analysis ordered in three 
groups: fixed-target experiments, HERA and the Tevatron. For each process we give an indication 
of their dominant partonic subprocesscs, the primary partons which are probed and the approximate 
range of x constrained by the data. 



• an accurate determination of the behavior of the gluon and quark at small x (where it 
is dominated by the singlet in this region) and by individual light flavors at medium 
X (where NC and CC data play a role in separating individual flavors) is found from 
the very precise HERA NC and CC data; 

• information on the flavor separation at small x comes from Tevatron Drell-Yan data 
(in particular the W asymmetry, as discussed above); 

• the flavor separation at medium x is mostly controlled by the Drell-Yan data for 
fixed proton and nucleus targets; 

• the total valence component is constrained by the neutrino inclusive DIS data; 

• strangeness is controlled by neutrino dimuon data, as well as by the interplay of the 
W and Z production data with lower-scale DIS and Drell-Yan data; 

• the large x gluon, only weakly determined by DIS scaling violations, is further con- 
strained by Tevatron jet data. 



3 STATUS OF PDF SETS 

Various fitting groups currently produce general-purpose sets of PDFs of the nucleon, with 
most of the groups having a long history which goes back at least a couple of decades, as 
summarized in the introduction. Six of these groups have been providing regular updates 
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MSTW08 


CTIO 


NNPDF2.3 


HERAPDF1.5 


ABMll 


JR09 


HERA DIS 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


Fixed-target DIS 


✓ 


✓ 


✓ 


X 


✓ 


✓ 


Fixed-target DY 


✓ 


✓ 


✓ 


X 


✓ 


✓ 


Tevatron VF-|-Z-|-jets 


✓ 


✓ 


✓ 


X 


X 


X 


LHC W+Z+jets 




X 


✓ 


X 


X 


X 



Table 2. Data included in various NNLO PDF sets. 





MSTW08 
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Table 3. Main features of various NNLO PDF sets (see text for details). 



of their PDFs, and here we will discuss their most recent NNLO sets: MSTW08 [27], 
CTIO [110], NNPDF2.3 [111], HERAPDF1.5 [40], ABMll [35] and JR09 [33]. All of 
these sets are publicly available though the standard lhapdf interface [54], though CTIO 
NNLO and HERAPDF1.5 have not been presented in a journal publication. We will not 
discuss PDFs which are not available from lhapdf. Also, we will not discuss PDFs for 
different kinds of targets or with more specialized or limited goals: PDFs for nuclei or other 
hadrons (such as pions); PDFs partly or entirely determined based on models of hadrons 
rather than (or in addition to) data; PDFs for medium-energy physics which incorporate 
non- perturb at ive or higher-twist effects. 

The main feature which distinguishes PDF sets is the data on which they are based. 
These are summarized in Table 2. Only three groups (MSTW08 [27], CTIO [110], and 
NNPDF2.3 [111]) make a fully global fit, defined here to be a fit including HERA and 
fixed-target DIS data, fixed-target Drell-Yan production, and Tevatron data on W, Z and 
jet production. The NLO version of the JR09 fit, GJR08 [32], does include some Tevatron 
jet data. The NNPDF2.3 set is the only one to include LHC data; in order to assess the 
impact of the additional LHC data, we will also compare to a variant of the NNPDF2.3 
analysis without LHC data. Concerning HERA data, note that CTIO and NNPDF2.3 
include the combined HERA I inclusive data [38], MSTW08 and JR09 instead include the 
older separate data from HI and ZEUS, ABMll includes combined HERA I data but only 
with the cut < 1000 GeV^, and HERAPDF1.5 additionally includes the preliminary 
combined HERA II inclusive data [39]. The kinematical coverage of the NNPDF2.3 data 
set is shown in Figure 1, with the x and values shown determined using leading-order 
parton kinematics. 

As discussed in Section 2, various alternative choices are possible in PDF determination 
both in terms of theory and methodology. The main choices which underlie the PDF sets 
we consider here are summarized in Table 3. All sets are now available at NLO and NNLO, 
and all but HERAPDF also have a LO version, though, as mentioned in Section 2.1.3, LO 
PDFs are often optimized for use with Monte Carlo event generators, and thus we will not 
discuss them further. The number of independently parametrized PDFs varies between 
seven (the three lighter quarks and antiquarks and the gluon), six (the total strangeness 
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Figure 1. The kinematical coverage of the expernTiental data used in tlie NNPDF2.3 PDF deter- 
mination, from Reference [111]. 

is independently parametrized, but not the strange and antistrange separately) or five 
(strangeness not fitted, and assumed to be a fixed fraction of the sea). The methodology 
for uncertainty representation and determination is Hessian, based on a parametrization 
of the form of Equation 2.22 for all groups except NNPDF, which uses a Monte Carlo 
representation based on a neural network parametrization (see Section 2.2). In order to 
determine confidence levels, CT and MSTW use dynamical tolerance, denoted in the table 
as "DT" (recall Section 2.2), JR use simple tolerance, denoted by "T", HERAPDF uses 
Ax^ = 1 but supplemented by an estimate of model and parametrization uncertainties, 
and ABM just use Ax^ = 1. In each case the total number of free parameters at NNLO 
is as given in the table; the number of parameters at NLO is the same for all groups but 
CTIO, which at NLO has 26 parameters, and HERAPDF1.5, which at NLO has only 10 
parameters. MSTW08 uses 28 free parameters for the determination of the best fit, 8 of 
which are fixed when determining uncertainties. JR09 introduces the further "dynamical" 
assumption that PDFs are valence-like at a low scale Qq < 1 GeV^. All groups but ABM 
and JR use variable- flavor number schemes, with heavy-quark masses included using one 
of the matching methods discussed in Section 2.1. The treatment of as will be discussed 
in more detail below. 
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NNLO ag(M^) values used by different PDF groups 
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Figure 2. as{M^) values for which NNLO PDFs arc provided by various groups. The larger 
symbols denote the values used in subsequent plots. 



3.1 Values and Uncertainties of Strong Coupling 

An important issue which distinguishes PDF sets is the treatment of the strong coupling 
as{M'^). Because the value of as is strongly correlated with PDFs, one should always use 
in cross-section calculations PDFs which have been determined with the same value of as 
that is adopted for the calculation itself. 

The value of as{M'^), and its uncertainty, can either be determined simultaneously 
with the PDFs or imposed as an external constraint. Furthermore, if the value of as is 
determined simultaneously with the PDFs, the quoted value of the PDF uncertainties may 
refer strictly to the PDF-only uncertainty as as is kept fixed at its best-fit value, or it may 
also include the uncertainty due to the variation of as itself. 

The values of 0:5 (M^) used by different NNLO PDF fitting groups are shown in Fig- 
ure 2, where the larger symbols represent the default value used by each group, that is used 
for the determination of PDF uncertainties. For MSTW08, ABMll and JR09, this value 
is determined from the fit with uncertainties shown by the horizontal error bars, while for 
CT it is chosen as a fixed value close to the PDG world average [112], also shown in the 
plot. NNPDF do not have a default value and provide a full Monte Carlo replica set for 
each of the as values shown, though they have also presented an as determination [113] 
based on their previous NNPDF2.1 set, with results consistent with the PDG average. For 
NNPDF, which does not have a default value, we arbitrarily choose as{M^) = 0.119 as 
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MSTW 2008 NNLO PDFs (68% C.L.) 




Figure 3. MSTW 2008 NNLO PDFs at = iq QcV^ and = 10^ GcV^, from Reference [27]. 

default to be used in all plots, so that the MSTW08 and NNPDF as values bracket the 
CTIO value by providing a variation of Aas = 0.001 about it, for reasons to be discussed 
in Section 4.1.3 below. 

The smaller symbols in Figure 2 indicate the PDF sets with alternative values of 
as{M'^) provided by each group. All groups provide only the best-fit PDF set for each 
of these values, except NNPDF which instead provide a full set for each value. The PDF 
uncertainties provided by MSTW and CT at the reference value of as, and by NNPDF for 
all values of as, do not include the as uncertainty, though MSTW also provide additional 
sets allowing combined PDF-|-a5 uncertainties [114]. JR and ABM only provide combined 
PDF+Qs' uncertainties. 

3.2 Comparison of PDFs 

A typical set of PDFs (MSTW08) at two different scales is shown in Figure 3. It is clear from 
Equation 2.1, however, that, in order to understand properties of hadronic cross sections 
and the impact of PDFs on them, it is more useful to consider the relevant parton-parton 
luminosities, defined as in Equation 2.4, rather than the PDFs themselves. 

A detailed study would involve comparison of the luminosity for all parton combina- 
tions: for example, vC„j, relevant for production, and so forth. Here we only compare 
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the qq luminosity summed over quark flavors, i.e., 

+ (3-1) 

q=u,d,s,c,b 

where Cab is defined in Equation 2.4. In Figure 4 and Figure 5 we show, respectively, the 
NNLO qq and gg luminosities, displayed as a ratio to the MSTW 2008 NNLO luminosities, 
for the LHC at y/s = 8 TeV, and in Figure 5 we show the corresponding NNLO gg 
luminosities. We use the as values for each set shown in Figure 2. Note that all uncertainty 
bands are shown at 68% C.L., requiring the CTIO uncertainties (corresponding to a nominal 
90% C.L.) to be divided by a factor of 1.64485. Similar plots, using a common value of 
as{M'^) = 0.118, can be found in Reference [45]. 

The relevant values of = M\y^z sire indicated for the qq luminosities, and the rel- 
evant values of \/I = MH,2mt (for Mh = 126 GeV and mt = 173.18 GeV) are indicated 
for the gg luminosities. There is fairly good agreement for the three global fits (MSTW08, 
CTIO and NNPDF2.3), but more variation for the other sets, which confirms that the domi- 
nant factor in determining the features of the PDFs is the choice of data set. There is little 
difference between the luminosities computed using NNPDF2.3 and NNPDF2.3noLHC, 
which shows that the impact of the LHC data is moderate. We will see in Section 4 that 
this is often but not always the case. The NLO trend between groups is similar to the 
NNLO trend, with the exception of HERAPDF at large s values, where the HERAPDF1.5 
NLO set [39] has a much larger qq luminosity, and a much softer gg luminosity, than other 
NLO PDF groups. 

These luminosities are the basic input to LHC phenomenology, as we will discuss 
shortly. Current recommendations [46] to use global fits for LHC searches and calibration, 
already mentioned in Section 1 and to be discussed in Section 4 below, were based on sim- 
ilar, more detailed comparisons of luminosities and PDFs made in 2010 [42]. However, the 
situation is much improved now than in 2010, when only MSTW08 had a NNLO PDF set 
from a global fit, and differences at NLO between MSTW08, CTEQ6.6 and NNPDF2.0 were 
larger, due to, for example, the use of a less fiexible gluon parametrization in CTEQ6.6 and 
the lack of inclusion of terms suppressed by powers of the charm-quark mass in NNPDF2.0. 

A recent thorough analysis of PDFs and luminosities [45] shows that the general fea- 
tures of NNLO global PDF sets, at a scale of order of ~ following, bearing 
in mind that experimental information is not available outside the region 10""^ ^ 2; < 0.4 
(see Figure 1). Up and down quark and antiquark distributions are known to an accuracy 
better than about 5% in a wide range of x, roughly 10~^ ^ 2; < 0.3 for the up distri- 
bution, 10~^ < x < 0.1 for the down and the antiup distribution, lO"'^ ^ x < 0.01 for 
the antidown distribution, and there is good agreement between the three global sets. For 
smaller values of x, uncertainties gradually blow up, but there remains good agreement 
between sets as the behavior in this region is mostly driven by perturbative evolution, 
while, for larger values of x, uncertainties blow up and widely different behaviors are ob- 
served between sets: already for x ~ 0.5 uncertainties are likely to be bigger than 10%, and 
perhaps underestimated especially as x grows. Strangeness is nominally known to about 
10-15% accuracy in the region 0.003 ^ x < 0.1. However, it should be kept in mind that 
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(a) 



NNLO Eq(qq) luminosity at LHC (Vs = 8 TeV) 




s (GeV) 



(b) 



NNLO Eq(qq) luminosity at LHC = 8 TeV) 
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Figure 4. NNLO qq luminosity functions taken as the ratio to MSTW08. (a) MSTW08 vs. CTIO 
vs. NNPDF2.3noLHC vs. NNPDF2.3, then (b) MSTW08 vs. ABMll vs. HERAPDF1.5 vs. JR09. 
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(b) 



NNLO gg luminosity at LHC (Vs = 8 TeV) 




(GeV) 



Figure 5. NNLO gg luminosity functions taken as the ratio to MSTW08. (a) MSTW08 vs. CTIO 
vs. NNPDF2.3noLHC vs. NNPDF2.3, then (b) MSTW08 vs. ABMll vs. HERAPDF1.5 vs. JR09. 
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strangeness is largely determined by neutrino dimuon data (see Section 2.3.3), which are 
subject to various poorly controlled systematics, and also, one of the three global sets does 
not parametrize independently the s and s distribution, while another only has a small 
number of parameters. Indeed, disagreement between different sets are up to the 30% 
level. The gluon distribution is known with an accuracy which is comparable or marginally 
worse than that of light quarks, i.e., ~ 5% at small 10~^ ^ x < 0.1, but rapidly deterio- 
rates at larger x, where it is only constrained by jet data. As already mentioned, here the 
agreement between global sets is not as good as one might hope, and discrepancies up to 
the level of 1.5-2 sigma between global fits are observed in the region x ~ 0.02, which is 
relevant for Higgs production. 

Comparison of NLO and NNLO PDFs suggests that uncertainties related to higher- 
order corrections are smaller than 5% in the region where PDFs are currently determined, 
meaning that the neglected theory uncertainties are likely to be smaller than the experi- 
mental PDF uncertainties at NNLO, whereas at NLO they might be comparable. 

4 LHC PHENOMENOLOGY 

The first LHC proton run was completed in December 2012, after a remarkable three years, 
with a center-of-mass energy of 7 TeV in 2010/2011 and 8 TeV in 2012. Data have been 
collected for a vast array of Standard Model processes. Many of these are already leading 
to new significant constraints on PDFs, with others holding the promise to do so in the 
very near future, and the knowledge of PDFs has played a significant role in the discovery 
of a Higgs-like particle [1, 2]. In this section, we confront LHC data with the predictions of 
various PDF sets for some key Standard Model total cross sections, specifically W, Z, Higgs 
boson and top-pair production, then we discuss methods for combining the predictions 
made using the PDF sets from different groups. We finally assess the current constraints 
on PDFs provided by LHC data and we examine the prospects for future improvements. 

4.1 Predictions for LHC Cross Sections 

A comprehensive study of the PDF dependence of key LHC cross sections has recently been 
made in Reference [45], following earlier work in References [42-44]. Here we will present 
some selected results from a continuation of the earlier study [42-44], but now updated to 
account for the latest PDF sets and LHC data. The impact of LHC data will be discussed 
in more detail in Section 4.2, but we will already show here results with the two variants of 
NNPDF2.3, with and without LHC data, which will give us an indication of their current 
impact. 

4.1.1 W and Z Production 

The W'^, W~ and Z cross sections at the LHC differ from those at the Tevatron discussed 
in Section 2.3 because the LHC is a pp rather than a pp collider. Hence, somewhat different 
combinations of the light quark and antiquark distributions are measured. To understand 
this, it is useful to consider the = -\- W~ to Z^ cross-section ratio, and the to 
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W cross-section ratio. Assuming that very roughly u{x2) ~ d{x2), and neglecting heavier 
quarks, one gets 

(Tw+ + ^w- u{xi) + d{xi) 

o^o ^ 0.29m(xi) +0.37(i(2;i)' 
ay/+ u{xi)d{x2) u{xi) 
aw- d{xi)u{x2) d{xi)' 

where xi and X2 are fixed when measuring a rapidity distribution, and we have assumed 
that xi^2 are in a region where q{xi)q{x2) dominates over q{xi)q{x2). Equation 4.1 shows 
that the and Z cross sections are very highly correlated (so that their ratio depends 
very little on the PDFs), while Equation 4.1 (to be compared to its counterpart at a pp 
collider, Equation 2.40) shows that the ratio of W'^ /W~ cross sections is a sensitive probe 
of the u/d ratio. 

We now consider explicitly predictions for W and Z production, where, for definite- 
ness, we show results only for total cross sections. However, as mentioned, only rapidity 
distributions probe fixed leading-order parton kinematics. Furthermore, only around half 
of the total and Z° cross sections lie inside the acceptance of the ATLAS and CMS 
detectors, so that a theory calculation is needed to extrapolate the measurement over 
the whole phase space, introducing an additional uncertainty on the total cross sections. 
Therefore, data-to-theory comparisons for precision physics are best made at the level of 
the fiducial cross section (i.e., within the acceptance), which is possible at NNLO using 
the public FEWZ [115, 116] and DYNNLO [117] codes, and indeed was done in the ATLAS 
publication [118]. 

In Figure 6 we show = + W~ versus Z^ and versus W~ total cross 
sections. We also compare to the experimental measurements using the 2010 LHC data 
from ATLAS [118] and CMS [119]. The measured Z^ cross sections have been corrected [43] 
for the small 7* contribution and the finite invariant-mass range of the lepton pair (different 
for ATLAS and CMS) using a theory calculation at NNLO [120]. The ellipses are drawn 
to account for the correlations between the two cross sections, both for the experimental 
measurements and for the theoretical predictions, in such a way that the projection onto 
either axis gives the one-sigma uncertainty for the individual cross sections, so that the area 
of the two-dimensional ellipse corresponds to a confidence-level somewhat smaller than the 
conventional 68% [84]. 

The spread in predictions using the different PDF sets is comparable to the (domi- 
nant) luminosity uncertainty of 4% (CMS) or 3.4% (ATLAS), with the JR09 prediction 
being a clear outlier. The correlation of and Z cross sections is clearly visible from 
the plot. The impact of LHC data can be gauged by comparing the NNPDF2.3noLHC and 
NNPDF2.3 predictions: while for the and Z cross sections there is essentially no differ- 
ence, consistent with the stability of the qq luminosity of Figure 4, there is a clear reduction 
of uncertainty in the /W~ cross-section ratio, seen as a shrinking of the corresponding 
ellipse, which comes from an improved knowledge of the light flavor separation. 
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NNLO W and Z cross sections at the LHC (\fs = 7 TeV) 
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Figure 6. (a) vs. Z° and (b) vs. W total cross sections at NNLO, compared to data 
from CMS [119] and ATLAS [118]. Lines (dotted) of constant ratio are also drawn for reference. 
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4.1.2 Higgs and Top-pair Production 

Whereas the cross sections for production of W and Z bosons are sensitive to the quark 
distributions, we now turn to processes that are sensitive to the gluon distribution. The 
dominant production mechanism for both Standard Model Higgs bosons, or top-pairs, at 
the LHC is through gluon-gluon fusion. The gg ^ H process proceeds mainly through a 
top-quark loop, so both processes start at 0{a^) at LO and are directly sensitive to the 
value of as{M^). Indeed, the CMS measurement of the tt cross section has even been used 
to extract as [121]. Moreover, the gluon PDF itself, being necessarily determined through 
strong-interaction processes, is the most sensitive to the value of as- Therefore, for these 
observables we will present predictions for cross sections as a function of as{M'^). 

The gg ^ H and tt cross sections are shown in Figure 7 for a Higgs mass Mh = 
126 GeV and a top-quark pole mass of mt = 173.18 GeV [122], for the LHC at 8 TeV and 
7 TeV, respectively, probing the gluon distribution at different x ~ Mul\fs = 0.02 and 
X > 2mt/\fs = 0.05. For Higgs production we use the GGH@nnlo (version 1.4.1) code [123] 
with a scale choice oi = np = -^i//2, while for top-pair production we use the TOPH — h 
(version 1.4) code [124] with a scale choice of /j-r = = rrit. Higgs production at NNLO, 
in the limit of a heavy top-quark mass, was originally calculated in References [123, 125, 
126]. The top-pair calculations [124] include exact NNLO corrections for all quark- initiated 
processes (and qg — t- tt), with approximate NNLO for gg — t- tt, together with soft-gluon 
resummation to next-to-next-to-leading logarithmic accuracy. The markers in Figure 7 are 
centered on the 05 (M^) values of Figure 2 and the corresponding predicted cross section 
of each PDF fitting group. The horizontal error bars span the as{M'^) uncertainty, while 
the vertical error bars span the PDF uncertainty, which (recall Section 3.1) for ABMll 
and JR09 necessarily correspond to a combined PDF-|-a5 uncertainty. For MSTW08, the 
inner vertical error bars span the PDF only uncertainty and the outer vertical error bars 
span the PDF-t-a^ uncertainty. 

The as dependence of results is shown from the dashed lines, which interpolate the 
cross-section predictions calculated using the sets with different 05 (M|) values provided 
by each group. The vertical shaded band indicates the PDG world average value [112] 
of as(M^), while the horizontal shaded band in Figure 7(b) indicates the preliminary 
combination of ATLAS and CMS tt cross-section measurements [127, 128]. The scale de- 
pendence, obtained by varying fiR and hf by factors of two, subject to the constraint 
1/2 < hr/ hf < 2, leads to a theoretical uncertainty of 149%, while the top-mass depen- 
dence obtained by varying nit = 173.18 it 0.94 GeV [122] leads to an uncertainty on the 
predicted cross section of ±2.8%, where these percentage uncertainties were obtained using 
the central MSTW08 PDF set. 

The strong dependence on the value of as is clearly seen. For the Higgs cross section it 
is interesting to observe (comparing also the gg luminosity plot in Figure 5) that the value 
of Mh ~ 126 GeV with ^/s = 8 TeV is especially unlucky, in that predictions obtained 
using global fits maximally disagree. Note that the NNPDF2.3 and NNPDF2.3noLHC in 
these plots are essentially identical: this shows that, unlike the W data discussed above, 
the LHC jet data included in the NNPDF2.3 fit have a very moderate impact. 
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Figure 7. (a) NNLO gg H total cross sections for Mh = 126 GcV, and (b) NNLOapprox.+NNLL 
tt total cross sections for nit = 173.18 GeV, both plotted as a function of as{M^). 
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We see from Figure 7(b) that, even after accounting for all uncertainties, the ABMll 
PDF set is strongly disfavored by both the LHC top-pair cross section and the world average 
value of as(M|). The ABMll prediction of = 123.3±5.5(PDF+a5)l^;^(scales)l||(mt) pb 
is almost 30% below the measured cross section of cr^f = 173.3 it 10.1 pb, whereas adding all 
experimental and theoretical uncertainties in quadrature still gives a total uncertainty of 
less than 8%. We note also from Figure 7 that the HERAPDF1.5 NNLO predictions have 
a very large model uncertainty in the upwards direction, due to varying the minimum 
cut from the default value of Q^jn = 3.5 GeV^ to a slightly higher value of Q^jn = 5 GeV^. 
This sensitivity is not observed in global fits, where the Tevatron jet data stabilize the fit 
and so lessen sensitivity to the fine details of the treatment of the DIS data [44, 129]. 

4.1.3 Combination of Results from Different PDF Groups 

Whereas it is often advisable to check experimental results against predictions obtained 
using the widest available set of PDFs, there are situations where a unique reliable predic- 
tion is needed. Typical examples are searches for new physics, or acceptance calculations, 
where one does not want to inflate uncertainties unnaturally, but also does not want to 
mistake an underestimated systematic effect for a new physics effect, as has sometimes 
happened in the past. 

This suggests that use of PDF sets based on significantly smaller data sets is not advis- 
able, as these necessarily have larger uncertainties (and indeed sometimes have appeared 
as outliers, as seen above). As mentioned in Section 1, the PDF4LHC group recommended 
therefore [46] the use of the CTEQ, MSTW and NNPDF PDFs based on global fits for these 
purposes at the LHC. This recommendation was adopted by the Higgs working group [47], 
and used for Higgs searches and discovery [1, 2]. 

The problem however arises of the best way to combine results from different groups. 
The PDF4LHC group originally recommended taking an envelope of various predictions, 
i.e., taking as a 68% C.L. the range between the top of the highest one-sigma bands, and 
the bottom of the lowest, with the midpoint as a central value. This choice is simple to 
implement, but it has no clear-cut statistical meaning. A better option is to take a statis- 
tical combination as originally proposed in Reference [48] and discussed in Section 2.2.3. 
Results obtained either way are shown in Figure 8, where in order to construct the statis- 
tical combination 100 Monte Carlo replicas were generated for MSTW08 and CTIO from 
the original Hessian sets using the method described in Reference [98]; see Equation 2.28. 
It is clear that the two methods actually produce fairly similar results, though the enve- 
lope method is somewhat more conservative, especially when there is some disagreement 
between predictions. 

A separate issue is how to treat the as uncertainty. The original PDF4LHC prescrip- 
tion recommended a very conservative approach in which one takes the envelope of three 
PDF+as' uncertainties, each centered at a different central value of as'- this was motivated 
by the feeling that the PDG as{M'^) uncertainty of Aas = 0.0007 might be somewhat 
underestimated. The uncertainties shown in Figure 8 are PDF-only uncertainties for a 
fixed as value. However, results for the three PDF sets are obtained using the three differ- 
ent central values shown in Figure 2, so the combination implicitly includes an additional 



-34- 



(a) 



^ 0.99 



1 0-98 

T 

"^vl 0.97 

m 

o 0-96 
D 

0.95 
0.94 
0.93 



NNLO Z° -> ri" at the LHC (\/s = 7 TeV) 



Solid lines: envelope and midpoint. 
Dashed lines: statistical combination. 



MSTW08 CT10 NNPDF2.3 

Ks{M^) = 0.1171 o:s{M^) = 0.1180 c(s{M^) = 0.1190 



Open marl<ers: usual best-fit and 68% C.L. Hessian uncertainty. 
Closed marl<ers: average and s.d. over random predictions. 



(b) 



NNLO WW ratio at the LHC (s/s = 7 TeV) 



e 

III 

+1 



1.5 
1.49 
1.48 
1.47 
1.46 E 
1.45 E 
1.44 E 
1.43 E 
1.42 z 
1.41 ; 

1.4 = 



Solid lines: envelope and midpoint, 
r Dashed lines: statistical combination. ~l 












1 




' -w 


1 










'- MSTW08 CT10 NNPDF2.3 ^ 

: o;s(M^) = 0.1171 o:g{M^) = 0.1180 as{M^) =0.1190 : 



open markers: usual best-fit and 68% C.L. Hessian uncertainty. 
Closed markers: average and s.d. over random predictions. 



(c) 



SI 
Q. 



NNLO+NNLL tl cross sections at the LHC (\Js = 7 TeV) 

'•65 ^ Solid lines: envelope and midpoint. 

Dashed lines: statistical combination. 



Top++ (V1.4), = Hp = m[ = 173.18 GeV 

MSTW08 CT10 NNPDF2.3 

Kg{M|) = 0.1171 o:g{M|) = 0.1180 ag{M^) = 0.1190 



Open markers: usual best-fit and 68% C.L. Hessian uncertainty. 
Closed markers: average and s.d. over random predictions. 



(d) 



NNLO gg^H at the LHC (Js = 8 TeV) for IV^ = 126 GeV 

^ 22r 

a. 

^ 21.5E 



21 
20.5 





Solid lines: envelope and midpoint. 
Dashed lines: statistical combination. 




: 1 : 








\\ W 


1 










ggh@nnlo (v1.4.1), ji 


= Hp = Mh / 2 






MSTW08 CT10 NNPDF2.3 

o:s(M|) = 0.1171 o:g{M^) = 0.1180 ag{M^) = 0.1190 





Open markers: usual best-fit and 68% C.L. Hessian uncertainty. 
Closed markers: average and s.d. over random predictions. 



Figure 8. NNLO (a) Z", (b) W"^ jW , (c) tt and (d) gg ^ H cross sections from MSTW08, 
CTIO and NNPDF2.3, combined either by taking the envelope of the three predictions, or from the 
statistical combination of 100 random predictions from each group. 



uncertainty arising from the different as values of 05 (M^) = 0.118 it 0.001, without go- 
ing into the complication of computing explicit PDF+ag uncertainties on each separate 
prediction. 

4.2 PDF Constraints From the LHC 

The LHC data have already started bringing in new information on PDFs. This is expected 
to be even more the case in the coming years as more processes will be studied thoroughly. 
Indeed, as mentioned in Section 2.2.1, full information on the correlations between the 
systematic uncertainties is necessary if data are to be used for PDF determination. This 
is at present only available for a handful of measurements (in particular those used in 
NNPDF2.3, see Figure 1), but this situation is rapidly changing. In fact, it is likely that in 
the next one or two decades most information on PDFs, which will be crucial for new physics 
searches at the LHC, will be coming from the LHC itself. The only possible exception is 
if the proposed Large Hadron-electron Collider (LHeC) [130] were to be built. This would 
lead to a very substantial increase in knowledge of high-energy deep-inelastic scattering 
which, using separate information on charged- and neutral-current processes, together with 
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charm tagging, would allow complete flavor separation from DIS alone (as discussed in 
Section 2.3), while the large lever arm in would allow an accurate determination of the 
gluon from scaling violations. More accurate results on flavor separation could only be 
obtained at a neutrino factory [106]. 

4.2.1 Light Flavors 

The strongest constraint on light flavor PDF at the LHC comes from the combination of 
rapidity distributions for the production of various gauge bosons: as repeatedly mentioned, 
a rapidity distribution fixes entirely the leading-order parton kinematics. If full information 
on the correlation between different processes is retained, a global fit including all of them 
effectively uses the information provided by all the various cross-section ratios that are 
sensitive to different PDF combinations, such as Equations 4.1 and 4.2, or, equivalently 
the asymmetry 

<Jw++(^w- u{xi) + d{xi) 
It is important to observe that many sources of systematics are common (such as for 
instance the normalization) to these cross sections and cancel in the ratio: hence the 
availability of full correlations leads to potentially much more precise results. Given that 
the LHC energy is being increased in stages, it is also possible to form ratios or double 
ratios between measurements at different energies which further increase the potential for 
precision [131]. 

The most discriminating data on W production so far are the CMS asymmetry data [132], 
included in the NNPDF2.3 fit, and which are mostly responsible for the sizable reduction 
in uncertainty seen in Figure 6(b) when comparing the NNPDF2.3 and NNPDF2.3noLIIC 
fits. These measurements have in particular shown that the asymmetry is underestimated 
by the MSTW08 fit, implying that Uy - is too small at x ~ Mw/\/s ~ 0.01. In Fig- 
ure 9 we show the CMS data from Reference [132] with > 35 GeV. It is apparent that 
inclusion of these data in the NNPDF2.3 analysis resulted in a more than three-fold re- 
duction in the NLO PDF uncertainty on the asymmetry at central pseudorapidity. They 
prompted a reexamination of the PDF parametrization (and also deuteron corrections) 
used in the MSTW08 fit, where an extended Chebyshev parametrization form and more 
flexible deuteron corrections were found to automatically improve the description of the 
CMS electron asymmetry data [93]; see Figure 9(b). (Figure 9 shows only NLO PDF sets 
since the main studies of Reference [93] were done at this order). This is an example of how 
present and future LHC data may help in resolving discrepancies which are still present 
between PDF sets. 

In addition to W and Z production, measurements of Drell-Yan processes {Z/'j* — )• ii 
or W ^ ii') away from the resonance peaks at Mu = Mz or M^^ = Mw in either direction 
can provide complementary constraints on different PDF combinations and x values. 

4.2.2 Gluon 

The NNPDF2.3 analysis also includes inclusive jet data from ATLAS [133]. These could 
in principle result in a reduction of uncertainties on the gluon distribution at the x values 
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Figure 9. Description of CMS electron asymmetry [132] using (a) various NLO PDF sets, and 
(b) improved variants of the MSTW08 analysis [93] . 
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in the intermediate region shown in Figure 1, though in practice their impact is very 
moderate, as seen from the luminosity plot in Figure 5 and the Higgs and top production 
cross sections in Figure 7. This situation is likely to change in the future, as jet data 
become more abundant and precise. For example, the recent CMS jet data [134] are 
expected to place more stringent constraints on the gluon distribution than the available 
ATLAS data [133]. 

As is clear from Figure 7, data for top-pair production, especially differential distri- 
butions [135], will provide stringent constraints on the gluon. The process is only known 
at NLO at the differential level, but at the inclusive level the full NNLO result is just 
around the corner [124]. Eventually, once all aspects of the observed Higgs-like boson are 
understood, even Higgs production itself (which at least in the Standard Model is known 
up to NNLO) could be perhaps the most sensitive probe of the gluon distribution. 

Other LHC data may be used to constrain the gluon distribution. A particularly clean 
probe might come from gauge-boson pT distributions, because the final-state W or Z can 
acquire a px dependence only if at least one parton is radiated. The values of pT accessible 
at the LHC are large enough that it should be possible to obtain significant constraints 
even by imposing a cut in order to avoid the low pT region where QCD resummation is 
mandatory and non-perturbative eff'ects might become relevant. The impact on PDFs of 
these data, some of which are already available [136, 137], has not been studied yet. How- 
ever, as with the closely related observable of VF,Z-|-jet production, only a NLO {0{a'g)) 
calculation is available. 

Another classic process which may be used to constrain the gluon distribution is prompt 
photon production, and the related photon+jet production process, to which gluons con- 
tribute at leading order. Inclusion of present-day LHC prompt photon data would only 
lead to a moderate reduction of order of ~ 20% of the gluon uncertainty [138] at medium- 
small X ~ 0.02, with photon-l-jet data having an even milder impact [139], but future data 
are likely to be much more constraining. Here, the main bottleneck is that theoretical 
predictions are only available up to NLO, and higher order corrections are likely to be 
large. Diphoton production is known up to NNLO [140], but it is less constraining in that 
the gluon contribution only starts at NLO. 

A potential option to get a handle on the gluon is the study of W polarization [141- 
143], which is similar to the W pT distribution discussed above, but subject to small QCD 
corrections [144]. 

4.2.3 Strangeness and Heavy Quarks 

As already discussed in Section 2.3.3, W production also provides a handle on strangeness, 
and in fact Tevatron W production data were found [31] to have a significant impact on 
the s — s distribution. It has been argued [145] that a fit only to inclusive and Z differ- 
ential cross sections [118], combined with inclusive DIS data from HERA, can significantly 
constrain the strange content of the proton, finding apparently no strange suppression, con- 
trary to previous determinations from CCFR/NuTeV dimuon cross sections {uN — t- fXfiX). 
However, these conclusions are weakened in a similar NNPDF study using a more flexi- 
ble parametrization form [111], or alternatively by incorporating a suitable tolerance [98], 
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and it seems that present-day LHC W production data only have a very minor impact on 
strangeness, but this is again hkely to change in the near future. 

A much more direct handle on strangeness at the LHC comes from W production with 
an associated charm-tagged jet, where the dominant partonic subprocesses are — )• c 
and s g ^ W~ c. A first preliminary measurement has been made by CMS [146] of the 
cross-section ratios Rc = cr{W + c) / a(W + iets), probing the strange content of the proton 
relative to other light-quark flavors, and Rf = a{W^ + c)/a{W~ + c), potentially probing 
the strange asymmetry. With more precise measurements to come, including differential 
distributions, the T^+charm process should enable powerful constraints to be made on the 
s and s distributions [147]. The main limitation here is that only NLO results are available 
for this process. 

Similarly, Z production in association with a tagged charm jet will provide significant 
constraints on the charm distribution [147], and likewise, Z with a 6 jet on the b distribution. 

5 THE FUTURE OF PDF DETERMINATION 

Parton distributions have become increasingly relevant with the advent of the LHC. Af- 
ter the landmark discovery of a Higgs-like boson in July 2012 [1, 2], focus now shifts to 
characterizing the properties of this new particle, as well as increasingly difficult searches 
for indications of other new physics. As a consequence, demand will grow to reduce the 
unavoidable uncertainties associated with the PDFs in calculations of both signal and 
background. On the one hand, it will be necessary to bring under complete control the 
uncertainties in the region of electroweak symmetry breaking, i.e., the region of x and 
which is probed by Higgs production. In this region, as mentioned in Section 3.2, uncer- 
tainties are in principle small, but in practice for some PDFs such as the gluon and even 
more the strange, discrepancies between different sets at the one or two sigma level are 
seen and not fully understood. On the other hand, searches for new physics will involve 
heavy final states, and thus, because of Equation 2.3, they will involve knowledge of PDFs 
in the large x > 0.5 region where they are currently very poorly known. To achieve these 
goals, it will be necessary to construct PDFs which satisfy a number of criteria, that not so 
long ago [48] characterized an ideal PDF determination, but in the LHC era have become 
necessary requirements, namely, in decreasing order of importance: 

1. The range and precision of data sets must be as wide as possible, cover currently 
unexplored kinematic regions, and include new LHC processes, such as those dis- 
cussed in Section 4.2, which will gradually remove current discrepancies between 
PDF determinations. 

2. The parametrization should be sufficiently general and demonstrably unbiased, 
either by using a sufficiently large number of parameters, or by careful a posteriori 
checks of parametrization independence. 

3. The experimental uncertainties should be understood and carefully propagated, 
and in particular the statistical meaning of the procedures that are being adopted 
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to determine the PDF uncertainties should be understood: specifically, the choice 
of tolerance, and the determination of the optimal fit when using a very flexible 
parametrization. 

4. Computations should be performed at the highest available perturbative order, 
and in particular, at the order which is subsequently to be used in the computation 
of partonic cross sections. This is currently NNLO, but the need for the inclusion 
of various kinds of all-orders resummation is becoming increasingly important. For 
example, computations which include Sudakov resummation are already being used 
in the computation of the Higgs production cross section [148], and for consistency 
resummed computations should also be used in PDF determination [149]: this be- 
comes especially important in the large x > 0.5 region that will be probed in searches 
for new physics. Also, the inclusion of resummation effects will be increasingly im- 
portant in the construction of PDFs to be used in fixed-order calculations matched to 
parton showers in Monte Carlo event generators. Finally, PDFs including electroweak 
corrections will have to be constructed [83]. 

5. The treatment of heavy quarks will have to include mass-suppressed terms in the 
coefficient functions, while also resumming logarithmically enhanced terms via the 
evolution equations. Such a treatment, like the schemes discussed in Section 2.1.3, 
is a minimum requirement: this is currently standard for DIS, but applications to 
hadronic observables are so far limited. Also, the dependence of results on the choice 
of value for the heavy-quark masses will have to be studied more systematically, 
possibly using the perturbatively more stable MS mass definition [150], with PDF 
sets made available for several values of the heavy-quark masses. 

6. The strong coupling 05, in addition to being determined simultaneously with 
PDFs, should also be decoupled from the PDF determination, with PDF sets avail- 
able for a range of fixed as values, and full PDF uncertainty determination for each 
value of as- 

7. Estimate of theoretical uncertainties will have to be performed together with 
PDF sets, and such uncertainties will have to be provided each time they become 
comparable with other sources of PDF uncertainty. This is presently an almost 
unexplored territory. 

We predict that, as the needs of precision physics at the LHC develop, more and more 
of these features will become accepted standards. 
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